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Abstract 

The problem of maximizing precision at the top of a ranked list, often dubbed Precision @k (prec@k), finds 
relevance in myriad learning applications such as ranking, multi-label classification, and learning with severe label 
imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its 
associated performance measure. 

The most notable of these is the lack of a convex upper bounding surrogate for prec@k. We also lack scalable 
perceptron and stochastic gradient descent algorithms for optimizing this performance measure. In this paper we 
make key contributions in these directions. At the heart of our results is a family of truly upper bounding surrogates 
for prec@k. These surrogates are motivated in a principled manner and enjoy attractive properties such as consistency 
to prec@k under various natural margin/noise conditions. 

These surrogates are then used to design a class of novel perceptron algorithms for optimizing prec@k with 
provable mistake bounds. We also devise scalable stochastic gradient descent style methods for this problem with 
provable convergence bounds. Our proofs rely on novel uniform convergence bounds which require an in-depth 
analysis of the structural properties of prec@k and its surrogates. We conclude with experimental results comparing 
our algorithms with state-of-the-art cutting plane and stochastic gradient algorithms for maximizing prec@k. 


1 Introduction 


Ranking a given set of points or labels according to their relevance forms the core of several real-life learning systems. 
For instance, in classification problems with a rare-class as is the case in spam/anomaly detection, the goal is to rank 
the given emails/events according to their likelihood of being from the rare-class (spam/anomaly). Similarly, in multi¬ 
label classification problems, the goal is to rank the labels according to their likelihood of being relevant to a data point 
ITsoumakas and Katakis| | |2007| . 

The ranking of items at the top is of utmost importance in these applications and several performance measures, 
such as Precision@k, Average Precision and NDCG have been designed to promote accuracy at top of ranked lists. Of 
these, the Precision@k (prec@k) measure is especially popular in a variety of domains. Informally, prec@k counts 
the number of relevant items in the top-k positions of a ranked list and is widely used in domains such as binary 
classification Joachims | 2005| , multi-label classification Prabhu and Varma]p014| and ranking Le and Smola 1 2007) . 

Given its popularity, prec@k has received attention from algorithmic, as well as learning theoretic perspectives. 
However, there remain specific deficiencies in our understanding of this performance measure. In fact, to the best of 
our knowledge, there is only one known convex surrogate function for prec@k, namely, the struct-SVM surrogate due 
to |Joachims]p005j which, as we reveal in this work, is not an upper bound on prec@k in general, and need not recover 
an optimal ranking even in strictly separable settings. 

Our aim in this paper is to develop efficient algorithms for optimizing prec@k for ranking problems with binary 


relevance levels. Since the intractability of binary classification in the agnostic setting Guruswami and Raghavendra 


1 2009| extends to prec @k, our goal would be to exploit natural notions of benign-ness usually observed in natural 
distributions to overcome such intractability results. 


* Work done while H.N. was an intern at Microsoft Research India, Bangalore. 
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1.1 Our Contributions 


We make several contributions in this paper that both, give deeper insight into the prec@k performance measure, as 
well as provide scalable techniques for optimizing it. 

Precision@k margin: motivated by the success of margin-based frameworks in classification settings, we develop 
a family of margin conditions appropriate for the prec@k problem. Recall that the prec@k performance measure 
counts the number of relevant items at the top k positions of a ranked list. The simplest of our margin notions, that 
we call the weak {k, j)-margin, is said to be present if a privileged set of k relevant items can be separated from all 
irrelevant items by a margin of 7. This is the least restrictive margin condition that allows for a perfect ranking w.r.t 
prec@k. Notably, it is much less restrictive than the binary classification notion of margin which requires all relevant 
items to be separable from all irrelevant items by a certain margin. We also propose two other notions of margin suited 
to our perceptron algorithms. 

Surrogate functions for prec@k: we design a family of three novel surrogates for the prec@k performance 
measure. Our surrogates satisfy two key properties. Firstly they always upper bound the prec@k performance measure 
so that optimizing them promotes better performance w.r.t prec@k. Secondly, these surrogates satisfy conditional 
consistency in that they are consistent w.r.t. prec@k under some noise condition. We show that there exists a one-one 
relationship between the three prec@k margin conditions mentioned earlier and these three surrogates so that each 
surrogate is consistent w.r.t. prec@k under one of the margin conditions. Moreover, our discussion reveals that the 
three surrogates, as well as the three margin conditions, lie in a concise hierarchy. 

Perceptron and SGD algorithms: using insights gained from the previous analyses, we design two perceptron- 
style algorithms for optimizing prec@k. Our algorithms can be shown to be a natural extension of the classical 
perceptron algorithm for binary classification [Rosenblatt 1 1958| . Indeed, akin to the classical perceptron, both our 
algorithms enjoy mistake bounds that reduce to crisp convergence bounds under the margin conditions mentioned 
earlier. We also design a mini-batch-style stochastic gradient descent algorithm for optimizing prec@k. 

Learning theory: in order to prove convergence bounds for the SGD algorithm, and online-to-batch conversion 
bounds for our perceptron algorithms, we further study prec@k and its surrogates and prove uniform convergence 
bounds for the same. These are novel results and require an in-depth analysis into the involved structure of the 
prec@k performance measure and its surrogates. However, with these results in hand, we are able to establish crisp 
convergence bounds for the SGD algorithm, as well as generalization bounds for our perceptron algorithms. 

Paper Organization: Section [^presents the problem formulation and sets up the notation. Section [^introduces 
three novel surrogates and margin conditions for prec@k and reveals the interplay between these with respect to 
consistency to prec@k. Section [^presents two perceptron algorithms for prec@k and their mistake bounds, as well 
as a mini-batch SGD-based algorithm. Section [^discusses uniform convergence bounds for our surrogates and their 
application to convergence and online-to-batch conversion bounds for our the perceptron and SGD-style algorithms. 
We conclude with empirical results in Section]^ 


1.2 Related Work 


There has been much work in the last decade in designing algorithms for bipartite ranking problems. While the earlier 
methods for this problem, such as RankSVM, focused on optimizing pair-wise ranking accuracy Herbrich et al.||2000), 
Joachims |2002|, Freund et al. |2003|, Burges et al. |20051, of late, there has been enormous interest in performance 


measures that promote good ranking performance at the top portion of the ranked list, and in ranking methods that 
directly optimize these measures Clemen^on and Vayatis |2007|, Rudin |2009|, Agarwal||2011 1, Boyd et al. |2012|, 


[Narasimhan and Agarwal|p013a 


b|,|Li et al.lpoT^ 


In this work, we focus on one such evaluation measure - Precision@k, which is widely used in practice. The only 
prior algorithms that we are aware of that directly optimize this performance measure are a structural SVM based 


cutting plane method due to Joachims 120051, and an efficient stochastic implementation of the same due to Kar et al. 
| 2014| . However, as pointed out earlier, the convex surrogate used in these methods is not well-suited for prec@k. 

It is also important to note that the bipartite ranking setting considered in this work is different from other popular 
forms of ranking such as subset or list-wise ranking settings, which arise in several information retrieval applications, 
where again there has been much work in optimizing performance measures that emphasize on accuracy at the top (e.g. 
NDCG) rViIizadegan et al.|p009l , |Caoerai:] | |2007l , |Yue et al.|p007) , |Le and SmoIa| | |2007l , |Chakrabajti et aTIp^ , 
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Yun et aLl 02014 


There has also been some recent work on perceptron style ranking methods for list-wise ranking 


problems [Chaudhuri and Tewari |2014|, but these methods are tailored to optimize the NDCG and MAP measures, 


which are different from the prec@k measure that we consider here. Other less related works include online ranking 


algorithms for optimizing ranking measures in an adversarial setting with limited feedback Chaudhuri and Tewari 
120151 . 


2 Problem Formulation and Notation 

We will be presented with a set of labeled points (x^, y^),..., (x„, y„), where Xi G X and yi G {0,1}. We shall 
use X to denote the entire dataset, X+ and X_ to denote the set of positive and negatively (null) labeled points, and 
y G {0,1}” to denote the label vector, z = (x, y) shall denote a labeled data point. Our results readily extend to 
multi-label and ranking settings but for sake of simplicity, we focus only on bipartite ranking problems, where the goal 
is to rank (a subset of) positive examples above the negative ones. 

Given n labeled data points zi,..., z„ and a scoring function s : Y — K, let CTs G Sn be the permutation that 
sorts points according to the scores given by s i.e. s(xo.^(j)) > s(xg.^(j)) for i < j. The Precision@k measure for this 
scoring function can then be expressed as; 


k 

prec@k(s; Zi,... ,z„) = ^(1 - y^,(i)). (1) 

i=l 

Note that the above is a “loss” version of the performance measure which penalizes any top-fc ranked data points that 
have a null label. For simplicity, we will use the abbreviated notation prec@k(s) := prec@k(s; zi,..., z„). We will 
also use the shorthand Si = s(xi). For any label vectors y', y" G {0,1}", we define 

n 

( 2 ) 

K(y\y") = ^y'y". 

i=l 

Let n_|_(y') = K{y',y') = ||y'||]^ denote the number of positives in the label vector y' and n_|_ = n+(y) denote 
the number of actual positives. Let y(®>^) be the label vector that assigns the label 1 only to the top k ranked items 
according to the scoring function s. That is, = 1 if if crj^(i) < k and 0 otherwise. It is easy to verify that for 

any scoring function s, A(y, yG’^)) = prec@k(s). 


3 A Family of Novel Surrogates for prec@k 


As prec @k is a non-convex loss function that is hard to optimize directly, it is natural to seek surrogate functions that 
act as a good proxy for prec@k. There will be two properties that we shall desire of such a surrogate; 

1. Upper Bounding Property; the surrogate should upper bound the prec@k loss function, so that minimizing 
the surrogate promotes small prec@k loss. 

2. Conditional Consistency; under some regularity assumptions, optimizing the surrogate should yield an optimal 
solution for prec @k as well. 

Motivated by the above requirements, we develop a family of surrogates which upper bound the prec@k loss function 
and are consistent to it under certain margin/noise conditions. We note that the results of Calauzenes et al. [2012) that 
negate the possibility of consistent convex surrogates for ranking performance measures do not apply to our results 
since they are neither stated for prec@k, nor do they negate the possibility of conditional consistency. 

It is notable that the seminal work of Joachims | 2005| did propose a convex surrogate for prec@k, that we refer to 
as However, as the discussion below shows, this surrogate is not even an upper bound on prec@k let alone 

be consistent to it. Understanding the reasons for the failure of this surrogate would be crucial in designing our own. 
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3.1 The Curious Case of ) 


The surrogate is a part of a broad class of surrogates called stmct-SVM surrogates that are designed for 

structured output prediction problems that can have exponentially large output spaces Joachims | 2005| . Given a set of 
n labeled data points, ^prec@k(-) is defined as 


max 

ye{o,i}" 

lly|li=fc 


A(y,y) + ^(y^ 



(3) 


The above surrogate penalizes a scoring function if there exists a set of k points with large scores (i.e. the second 
term is large) which are actually negatives (i.e. the first term is large). However, since the candidate labeling y is 
restricted to labeling just k points as positive whereas the true label vector y has n+ positives, in cases where n+ > k, 
a non-optimal candidate labeling y can exploit the remaining n+ — k labels to hide the high scoring negative points, 
thus confusing the surrogate function. This indicates that this surrogate may not be an upper bound to prec@k. We 
refer the reader to Appendix [A| for an explicit example where, not only does this surrogate not upper bound prec@k, 
but more importantly, minimizing ) does not produce a model that is optimal for prec@k, even in separable 

settings where all positives points are separated from negatives by a margin. 

In the sequel, we shall propose three surrogates, all of which are consistent with prec@k under various noise/margin 
conditions. The surrogates, as well as the noise conditions, will be shown to form a hierarchy. 


3.2 The Ramp Surrogate ( ) 

The key to maximizing prec@k in a bipartite ranking setting is to select a subset of k relevant items and rank them at 
the top k positions. This can happen iff the top ranked k relevant items are not outranked by any irrelevant item. Thus, 
a surrogate must penalize a scoring function that assigns scores to irrelevant items that are higher than those of the top 
ranked relevant items. Our ramp surrogate implicitly encodes this strategy: 


max 

lly|li=fc 


A(y,y) + '^y^Si 


2=1 


n 


— max 
^(y,y)=^ 




{p) 


Si 


(4) 


The term (P) contains the sum of scores of the k highest scoring positives. Note that is similar to the “ramp” 

losses for binary classification 


for prec @k. 


Do et al. 


12008; 


We now show that ^p™@kv 


is indeed an upper bounding surrogate 


Claim 1. For any k < n+ and scoring function s, we have > prec@k(s). Moreover, if < ifor 

a given scoring function s, then there necessarily exists a set S C [n] of size at most k such that for all ||y|li = k, we 
> Er=i y^s^ + A(y, y) - 

Proofs for this section are deferred to Appendix [B] We can show that this surrogate is conditionally consistent as 
well. To do so, we introduce the notion of weak {k, y)-margin. 


Definition 2 (Weak (fc, 7 )-margin). A set of n labeled data points satisfies the weak {k,y)-margin condition if for 
some scoring function s and set Sj,. C X_|_ of size k, 


min Si — max s, > 7 . 

j-yj=0 

Moreover, we say that the function s realizes this margin. We abbreviate the weak (fc, l)-margin condition as simply 
the weak k-margin condition. 
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Clearly, a dataset has a weak {k, 7 )-margin iff there exist some k positive points that substantially outrank all 
negatives. Note that this notion of margin is strictly weaker than the standard notion of margin for binary classification 
as it allows all but those k positives to be completely mingled with the negatives. Moreover, this seems to be one of 
the most natural notions of margin for prec@k. The following lemma establishes that is indeed consistent 

w.r.t. prec@k under the weak fc-margin condition. 

Claim 3. For any scoring function s that realizes the weak k-margin over a dataset, = prec@k(s) = 0. 

This suggests that fp™@i;(-) is not only a tight surrogate, but tight at the optimal scoring function, i.e. prec@k(s) = 
0; this along with upper bounding property implies consistency. However, it is also a non-convex function due to the 
term (P). To obtain convex surrogates, we perform relaxations on this term by first rewriting it as follows: 

n n 

(P) = y^y^Si- min VyiSi, (5) 

y^y 

||y||i=n+-fc*=l 

(Q) 

where y ^ y implies that yi = 0 => y^ = 0. Thus, to convexify the surrogate ^p™@i;(')’ we need to design a convex 
upper bound on (Q). Notice that the term (Q) contains the sum of the scores of the n+ — k lowest ranked positive 
data points. This can be readily upper bounded in several ways which give us different surrogate functions. 


3.3 The Max Surrogate ^™ec@k( ) 

An immediate convex upper bound on (Q) is obtained by replacing the sum of scores of the — fc lowest ranked 
positives with those of the highest ranked ones as follows: (Q) < maxwhich gives us the 

llylli="+-fe 

surrogate defined below: 


max 

llylll^*: 


A(y,y) + ^(y*-yi 


Si + max Yi 

yr<(i-y)-y 
lly|li='"+-'= * 


( 6 ) 


The above surrogate, being a point-wise maximum over convex functions, is convex, as well as an upper bound on 
prec@k(s) since it upper bounds ^p™@k(®)' "^his surrogate can also be shown to be consistent w.r.t. prec@k under the 
strong 7 -margin condition defined below for 7 = 1. 


Definition 4 (Strong 7 -margin). A set of n labeled data points satisfies the 7 -strong margin condition if for some 
scoring function s, mini^y^^i Si — T[iaxj-,yj=o Sj > 7 . 

We notice that the strong margin condition is actually the standard notion of binary classification margin and 
hence much stronger than the weak {k, 7 )-margin condition. It also does not incorporate any elements of the prec@k 
problem. This leads us to look for tighter convex relaxations to the term (Q) that we do below. 


3.4 The Avg Surrogate £prfe@k( ) 


A tighter upper bound on (Q) can be obtained by replacing (Q) by the average score of the false negatives. Define 
C'(y) = consider the relaxation (Q) < Sr=i(l ~ yi)yiSi- Combining this with Q, we get a 

new convex surrogate fpree@k('S) defined as: „ n s 

I max^ I y'l + Yl “ y*) + C(^ “ yi)yiSi I • (7) 


We refer the reader to Appendix B.4 for a proof that fprec@k(') upper bounding surrogate. It is notable that for 


k = n+ (i.e. for the PRBEP measure), the surrogate fpjfc@k(') recovers Joachims’ original surrogate ^p™c%k(')- To 
establish conditional consistency of this surrogate, consider the following notion of margin: 
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prec@k(s) < < Cec@k('S) < C“@k(s) 

tr iy iy 


weak {k, 7) 
margin 


D 


ik,'y) 

margin 


D 


strong 7 
margin 


Figure 1: A hierarchy among the three surrogates forprec@k and the corresponding margin conditions for conditional 
consistency. 


Definition 5 ((fc, 7 )-margin). A set of n labeled data points satisfies the {k, ^)-margin condition if for some scoring 
function s, we have, for all sets C X+ of size n+ — k 1, 


1 


n+ — k + 1 


> Si — max s. 

iei>+ 


> 7- 


Moreover, we say that the function s realizes this margin. We abbreviate the (k, l)-margin condition as simply the 
k-margin condition. 

We can now establish the consistency of ) under the fc-margin condition. See Appendix |B .5 1 for a proof 

Claim 6. For any scoring function s that realizes the k-margin over a dataset, ^prec@k('®) ~ prec@k(s) = 0. 

We note that the (fc, 7 )-margin condition is strictly weaker than the strong 7 -margin condition (Definition|^ since 
it still allows a non negligible fraction of the positive points to be assigned a lower score than those assigned to 
negatives. On the other hand, the (/c, 7 )-margin condition is strictly stronger than the weak (/c, 7 )-margin condition 
(Definition]^. The weak fc-margin condition only requires one set of /c-positives to be separated from the negatives, 
whereas the above margin condition at least requires the average of all positives to be separated from the negatives. 

As Figurel^demonstrates, the three surrogates presented above, as well as their corresponding margin conditions, 
fall in a neat hierarchy. We will now use these surrogates to formulate two perceptron algorithms with mistake bounds 
with respect to these margin conditions. 


4 Perceptron & SGD Algorithms for prec@k 

We now present perceptron-style algorithms for maximizing the prec@k performance measure in bipartite ranking 
settings. Our algorithms work with a stream of binary labeled points and process them in mini-batches of a prede¬ 
termined size b. Mini-batch methods have recently gained popularity and have been used to optimize ranking loss 
functions such as ^prec@k(') Kar et al.| |2014 . It is useful to note that the requirement for mini-batches goes 

away in ranking and multi-label classification settings, for our algorithms can be applied to individual data points in 
those settings (e.g. individual queries in ranking settings). 

At every time instant t, our algorithms receive a batch of b points X( = [x^,..., x^] and rank these points using 
the existing model. Let At denote the prec@k loss (equation[2l at time t. If At = 0 i.e. all top k ranks are occupied by 
positive points, then the model is not updated. Otherwise, the model is updated using the false positives and negatives. 
For sake of simplicity, we will only look at linear models in this paper. Depending on the kind of updates we make, 
we get two variants of the perceptron rule for prec@k. 

Our first algorithm, PERCEPTRON @K-AVG, updates the model using a combination of all the false positives and 
negatives (see Algorithmic. The effect of the update is a very natural one - it explicitly boosts the scores of the positive 
points that failed to reach the top ranks, and attenuates the scores of the negative points that got very high scores. It 
is interesting to note that in the limiting case of fc = 1 and unit batch length (i.e. b = 1), the PERCEPTRON@K-AVG 
update reduces to that of the standard perceptron algorithm |Rosenblatt| | |1958) , [Minsky and Papert] | |1988[ for the choice 
yt = sign(si). Thus, our algorithm can be seen as a natural extension of the classical perceptron algorithm. 
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Algorithm 1 Perceptron@k-avg 


Input: Batch length b 


1 

w° t- 0,t t- 0 


2 

while stream not exhausted do 


3 

t i — t “h 1 


4 

Receive b data points Xt = [xj,... , x^], yt € { 0 , 1 }*” 


5 

Calculate st = and let ft = 


6 

At t- A(yt,yt) 


7 

if At = 0 then 


8 

w* t— 


9 

else 


10 

11 

n. 

‘ l|ytlli-if(yt,yt) 

w‘ ^ w‘-i - Y.ie[b] (1 - yOyi • xj 

{false positives} 

12 

w‘ ^ w‘ -h Dt • Eig[6] (1 - yOyi • Xt 

{false negatives} 

13 

end if 


14 

end while 


15 

return w* 



Algorithm 2 Perceptron@k-max 


10: 5 't^FN(s,At) 

11: w*^w‘-i-Eie[6](l-y0yi'xi 

{false positives} 

12: w* ^ w‘ -h Eigst 

{top ranked false negatives} 


The next lemma establishes that, similar to the classical perception Novikoff|P962| , Perceptron@K-AVG also 
enjoys a mistake bound. Our mistake bound is stated in the most general agnostic setting with the hinge loss function 
replaced with our surrogate ^prec@k('®)' proofs in this section are deferred to Appendix 

Theorem 7. Suppose ||xj|| < Rfor all t,i. Let cumulative mistake value observed when 

A/gonf/imj^A executed for T batches. Also, for any w, let £^*(w) = X]t=i ^prec@k(^> y*)- Then we have 


Ay < min ^||w|| • R ■ k/dfc + sj 


Similar to the classical perception mistake bound Novikoff 119621, the above bound can also be reduced to a 
simpler convergence bound in separable settings. 

Corollary 8. Suppose a unit norm w* exists such that the scoring function s : x i—)■ x^w* realizes the (k, y)-margin 


< 




condition for all the batches, then Algorithm^^guarantees the mistake bound: Ay 

The above result assures that, as datasets become “easier” in the sense that their (/c, 7 )-margin becomes larger, 
Perceptron@K-AVG will converge to an optimal hyperplane at a faster rate. It is important to note there that the 
{k, 7 )-margin condition is strictly weaker than the standard classification margin condition. Hence for several datasets, 
Perceptron@K-AVG might be able to find a perfect ranking while at the same time, it might be impossible for 


standard binary classification techniques to find any reasonable classifier in poly-time Guruswami and Raghavendra 
||2009l . 

We note that Perceptron@K-AVG performs updates with all the false negatives in the mini-batches. This raises 
the question as to whether sparser updates are possible as such updates would be slightly faster as well as, in high 
dimensional settings, ensure that the model is sparser. To this end we design the Perceptron@K-MAX algorithm 
(Algorithm]^. Perceptron@K-MAX differs from Perceptron@K-AVG in that it performs updates using only a 
few of the top ranked false negatives. More specifically, for any scoring function s and m > 0, define: 


FN(s,m)= argmax 

SCX+.|S|=r, 


ies 




Vi Si 


1 






















Algorithm 3 SGD@k-avg 

Input: Batch length 6, step lengths 74, feasible set W 

Output: A model w G W 


1 

w° t- 0,f t- 0 


2 

while stream not exhausted do 


3 

f f -1- 1 


4 

Receive b data points Xt = [xj, ..., Xj], yt € { 0 , 1 }*” 


5 

Setgt G 9 wfp)fc@k(wt-i;Xt,yt) 

{See Algorithms 

6 

wt t- IIw [wt_i - r]t ■ gt] 

[project onto set Vv} 

7 

end while 


8 

return w = j: Et-i 



Algorithm 4 Subgradient calculation for ^prec@k(') 


Input: A model Win, n data points X, y, parameter k 
Output: A subgradient g € X, y) 


Dk' 


■^k — k — 

Si 


Sort pos. and neg. points separately in dec. order of scores assigned by win i.e. > 
for A:' = 0 —> fc do 

k — k' 
n_|_ —k' 

A,, ^k- fc' - D,. Er=\.+i +Et"' 

gfc' ^ — Dk 

end for 

k* <— arg maxfc/ Ay 
return gfc. 


E n. 

i= 


:fe' + l ■ 


> s+ and Sp > ... > s„_ 


as the set of the m top ranked false negatives. Perceptron@K-MAX makes updates only for false positives in 
the set FN(s, At). Note that At can significantly smaller than the total number of false negatives if fc <C n+. 
Perceptron@K-MAX also enjoys a mistake bound but with respect to the f™c(a,]..(-) surrogate. 

Theorem 9. Suppose II^Jll < Rfor all t,i. Let A^ = Ef=i cumulative observed mistake value when 

Algorithni^^is executed for T batches. Also, for any w, let £“'“(w) = E^i Then we have 

Ag < min ^||w|| • R ■ . 


Similar to Perceptron@K-AVG, we can give a simplified mistake bound in situations where the separability 
condition specified by Definition |^is satisfied. 


* realizes the strong 

ikR^ 


Corollary 10. Suppose a unit norm w* exists such that the scoring function s : x i-A x^w 
y-margin condition for all the batches, then Algorithm^^guarantees the mistake bound: A^ < 

As the strong 7 -margin condition is exactly the same as the standard notion of margin for binary classification, 
the above bound is no stronger than the one for the classical perceptron. However, in practice, we observe that 
Perceptron @ K-MAX at times outperforms even PERCEPTRON @ K-AVG, even though the latter has a tighter mistake 
bound. This suggests that our analysis of PERCEPTRON®K-MAX might not be optimal and fails to exploit latent 
structures that might be present in the data. 

Stochastic Gradient Descent for Optimizing prec@k. 

We now extend our algorithmic repertoire to include a stochastic gradient descent (SGD) algorithm for the prec@k 
performance measure. SGD methods are known to be very successful at optimizing large-scale empirical risk mini¬ 
mization (ERM) problems as they require only a few passes over the data to achieve optimal statistical accuracy. 

However, SGD methods typically require access to cheap gradient estimates which are difficult to obtain for non¬ 


additive performance measures such as prec@k. This has been noticed before by several previous works Kar et al. 
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| 2014| , [Narasimhan et al.|p015| who propose to use mini-batch methods to overcome this problem Kar et al.| |2014). 
By cornbining~the~£^^^^(-) surrogate with mini-batch-style processing, we design SGD@K-AVG (Algorithm |^, a 
scalable SGD algorithm for optimizing prec@k. The algorithm uses mini-batches to update the current model using 
gradient descent steps. The subgradient calculation for this surrogate turns out to be non-trivial and is detailed in 
Algorithm]^ 

The task of analyzing this algorithm is made non-trivial by the fact that the gradient estimates available to 
SGD@K-AVG via Algorithm are far from being unbiased. The luxury of having unbiased gradient estimates is 
crucially exploited by standard SGD analyses but unfortunately, unavailable to us. To overcome this hurdle, we pro¬ 
pose a uniform convergence based proof that, in some sense, bounds the bias in the gradient estimates. 

In the following section, we present this, and many other generalization and online-to-batch conversion bounds 
with applications to our perceptron and SGD algorithms. 


5 Generalization Bounds 


In this section, we discuss novel uniform convergence (UC) bounds for our proposed surrogates. We will use these UC 
bounds along with the mistake bounds in Theorems]^ andto prove two key results - 1) online-to-batch conversion 
bounds for the PERCEPTRON® K-AVG and PERCEPTRON@K-MAX algorithms and, 2) a convergence guarantee for 
the SGD@K-AVG algorithm. 

To better present our generalization and convergence bounds, we use normalized versions of prec@k and the 
surrogates. To do so we write k = k ■ n+ for some k G (0,1] and define, for any scoring function s, its prec@/c loss 
as: 

prec@K(s; zi,...,z„) = A(y, 

We will also normalize the surrogate functions by dividing by fc = k • n+. 

Definition 11 (Uniform Convergence). A performance measure 4* : W x (A x {0,1})" i-A- M+ exhibits uniform 
convergence with respect to a set of predictors W if for some a{b, S) = poly (|, log ^^,for a sample Zi,..., z^ of size 
b chosen Ltd. (or uniformly without replacement) from an arbitrary population Zi,..., z„, we have w.p. 1 — 5, 


sup 

wGW 


\A>( 


w; zi, 


i) - T'(w; zi,... ,Zf,)| < a(b,5) 


We now state our UC bounds for prec@K and its surrogates. We refer the reader to Appendix [P] for proofs. 
Theorem 12. The loss function prec@/t(-), as well as the surrogates ^prec@K(’) ^pkc@k(')’ exhibit 

uniform convergence at the rate a{b, 5) = O \ log 


Recently, 


Kar et al. 


12014 also established a similar result for the f'pje(.‘@k(0 surrogate. However, a very different 


terms 


proof technique is required to establish similar results for ^prec@K(’)’ partly necessitated by the 

in these surrogates which depend, in a complicated manner, on the positives predicted by the candidate labeling y. 
Nevertheless, the above results allow us to establish strong online-to-batch conversion bounds for PERCEPTRON @ K- 
AVG and PERCEPTRON@K-MAX, as well as convergence rates for the SGD@K-AVG method. In the following we 
shall assume that our data streams are composed of points chosen i.i.d. (or u.w.r.) from some fixed population Z. 


Theorem 13. Suppose an algorithm, when fed a random stream of data points, in T batches of length b each, generates 
an ensemble of models wi,..., Wy which together suffer a cumulative mistake value of IS.'f. Then, with probability 
at least 1 — 5, we have 


^^prec@K(w‘;Z) <^+0 

t=i 



The proof of this theorem follows from Theorem 12 which guarantees that 1 — (5, prec@«;(w‘; Z) < At/6 -f 
O ^ log for all t. Combining this with the mistake bound from Theorem 7 ensures the following generalization 
guarantee for the ensemble generated by Algorithmic 
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Average Prec@0.25 






-SVMPerf 

-1PMB 

Perceptron@k-avg 

Perceptron@k-max 

SGD@k-avg 

SGD@k-max 


(a) PPI 


(b) Letter 


(c) Adult 


(d) IJCNN 


Figure 2: A comparison of the proposed perceptron and SGD based methods with baseline methods (SVMPerf and 
IPMB) on prec@0.25 maximization tasks. PERCEPTRON@K-AVG and SGD@K-AVG (both based on ^prec@k(')) 
the most consistent methods across tasks. 



Figure 3: (a), (b): A comparison of different methods on optimizing prec@K for different values of k. (c), (d), (e): 
The performance of the proposed perceptron and SGD methods on prec@0.25 maximization tasks with varying batch 
lengths b. 


Corollary 14. Let w^, ..., be the ensemble of classifiers returned by the PERCEPTRON @K-AVG algorithm on a 
random stream of data points and batch length b. Then, with probability at least 1 — (5, for any w* we have 


T 

i^prec@K.(w*;Z) < ; Z) + c)\ 


where C = 0 (||lw*|| 


A similar statement holds for the Perceptron@k-max algorithm with respect to the surrogate as well. 

Using the results from Theorem 12 we can also establish the convergence rate of the SGD@K-AVG algorithm. 


Theorem 15. Let w be the model returned by Algorithm when executed on a stream with T batches of length b. 
Then with probability at least 1 — 5, for any w* £ W, we have , _, , , 

Cec®«(w;2) <Cc«.(w*;2)+G +0 

The proof of this Theorem can be found in Appendix]^ 


6 Experiments 


We shall now evaluate our methods on several benchmark datasets for binary classihcation problems with a rare-class. 

Datasets: We evaluated our methods on 7 publicly available benchmark datasets: a) PPI, b) KDD Cup 2008, c) 
Letter, d) Adult, e) IJCNN, f) Covertype, and g) Cod-RNA. All datasets exhibit moderate to severe label imbalance 
with the KDD Cup dataset having just 0.61% positives. 

Methods: We compared both perceptron algorithms, SGD@K-AVG, as well as an SGD solver for the f™ec@k( ) 
surrogate, with the cutting plane-based SVMPerf solver of Joachims |2005|. We also compare against stochastic 
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IPMB solver of Kar et al. | 2014| . The perceptron and SGD methods were given a maximum of 25 passes over the 
data with a batch length of 500. All methods were implemented in C. We used 70% of the data for training and the 
rest for testing. All results are averaged over 5 random train-test splits. 

Our experiments reveal three interesting insights into the problem of prec@k maximization - 1) using tighter sur¬ 
rogates for optimization routines is indeed beneficial, 2) the presence of a stochastic solver cannot always compensate 
for the use of a suboptimal surrogate, and 3) mini-batch techniques, applied with perceptron or SGD-style methods, 
can offer rapid convergence to accurate models. 

We first timed all the methods on prec@At maximization tasks for k = 0.25 on various datasets (see Figure]^. Of 
all the methods, the cutting plane method (SVMPerf) was found to be the most expensive computationally. On the 
other hand, the perceptron and stochastic gradient methods, which make frequent but cheap updates, were much faster 
at identifying accurate solutions. 

We also observed that PERCEPTRON @ K-AVG and SGD @ K-AVG, which are based on the tight suiTogate, 

were the most consistent at converging to accurate solutions whereas PERCEPTRON@K-MAX and SGD@K-MAX, 
which are based on the loose ^J5}ec@k(') surrogate, showed large deviations in performance across tasks. Also, IPMB 
and SVMPerf, which are based on the non upper-bounding surrogate, were frequently found to converge to 

suboptimal solutions. 

The effect of working with a tight surrogate is also clear from Figure (a), (b) where the algorithms working 
with our novel surrogates were found to consistently outperform the SVMPerf method which works with the ^p^(.%ij(-) 
surrogate. For these experiments, SVMPerf was allowed a runtime of up to 50x of what was given to our methods 
after which it was terminated. 

Finally, to establish the stability of our algorithms, we ran, both the perceptron, as well as the SGD algorithms with 
varying batch lengths (see Figure[^(c)-(e)). We found the algorithms to be relatively stable to the setting of the batch 
length. To put things in perspective, all methods registered a relative variation of less than 5% in accuracies across 
batch lengths spanning an order of magnitude or more. We present additional experimental results in Appendix]^ 
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A Structural SVM Surrogate for prec@k 

The structural SVM surrogate for prec@k for a set of n points {(xi, j/i),..., (x„, t/„)} G x {0,1})" and model 
w can be written as fp™c@k(^)- 


max 

ye{o,i}" 

lly|li=fe 




We shall now give a simple setting where this surrogate produces a suboptimal model. 

Consider a set of 6 points in M x {0,1}: {(—1,1), (—1,1), (—2,1), (—3, 0), (—3, 0), (—3, 0)}, and suppose we 
are interested in Prec@l. Note that the optimum model that maximizes prec@l on these points has a positive sign. 
We will now show that the model w* G M that maximizes the above structural SVM surrogate on these points has 
a negative sign. On the contrary, let us assume that w* has a positive sign, and arrive at a contradiction; we shall 
consider the following two cases: 

(i) w* > |. It can be verified that 


CckK) = 1 + (^(-^*) - l) - 
_ 1 * 

2^ 

On the other hand, for the model w' = —w*, we have 

Cc@kK) = 1 + - o) - 

= 1 + (-(3w*) - 0 I - ^ (tu* + w* + 2w*) 

\o / 0 

= < Cc@kK), 

where the last step follows from w* > |; clearly, w* is not optimal for the structural SVM surrogate, and hence a 
contradiction. 

(i) w* < |. Here we have 


Cc@k(^«*) = 1 + - o) - + -w* + -2w*) 

= l + \w*. 

6 

For w' = —rt;*, 

Cc@kK) = 1 + - o) - + -2«^') 

= 1 + (-(3w*) - 0 I - ^ (tu* + w* + 2w*) 

\6 / 0 

= 1 - ire* < Cc@kK)- 

Here again, we have a contradiction. Notice that this surrogate can take negative values (when w < —6 for example) 
whereas prec@k is a positive valued function. This clearly indicates that this surrogate cannot upper bound prec@k. 
More specifically, notice that for ru < 0, we have prec@k(w) = 1, however, the above analysis demonstrates cases 
when ^p^c'@k(^) ^ 1 which gives an explicit example that this surrogate is not even an upper bounding surrogate. 


14 



B Proofs of Claims from Section 1^ 

B.l Proof of Claim [1] 

Claim 1. For any k < n_|_ and scoring function s, we have 

C^@k(s) > prec@k(s). 

Moreover, if for some scoring function s, we have ^ C then there necessarily exists a set S G [n] of size at 

most k such that for a// ||y|| = k, we have 


y^s^ + A(y,y) - 

ies i=l 

Proof Let y = y(®’'=) so that we have A(y, y) = prec@k(s). Then we have 


/amp / 
^prec@kv'^ 


max 

lly|li=fc 


A(y,y) + E 


YiSi 


2=1 


n 


max 

lly|li=fc 

K{y,y)=k 


2=1 


n 

> A(y,y) + ^yis^ - 
2=1 


n 


max 


2=1 


= A(y,y) + 


n 


n 


max 

lly|li=fc 


^y^s^ - 
2=1 


max 

llylli^fc 

^(y,y)=fc 




> A(y,y), 


where the third step follows from the definition of y. This proves the first claim. For the second claim, suppose for 
some scoring function s, we have Then if we consider S* to be the set of A:-highest ranked positive 

points, then we have 


X 

iGS* 


n 


max 

lly|li=fe 

K{y-y)=k 


X y^'®* ^ 

2=1 


max 

llylli^^ 


A(y,y) + Xy* 


2=1 


i > Xy*®i + A(y,y) - c, 

i=l 


which proves the claim. 


□ 


B.2 Proof of Claim |3] 

Claim 3. For any scoring function s that realizes the weak k-margin over a dataset we have. 


C^@k(s) = prec@k(s) = 0. 

Proof Consider a scoring function s that satisfies the weak fc-margin condition and any y such that ||y||p = k. Based 
on the prec@k accuracy of y, we have the following two cases 
Case 1 (iT(y, y) = k): In this case we have 


n 


n 


n 


n 


A(y,y) + Xy*'5i “ 

2 = 1 


max 

lly|li=^^ 

K{y,y)^k 


X yi'®i = 

2=1 


o + Xy*«i- 

2=1 


max 

llylli^^ 

-f^(y,y)=fc 


^yiSi < 0, 

i=l 


where the first step follows since K(y, y) = k and the second step follows since ||y|j]^ = k, as well as K{y, y) = k. 
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Case2(i^(y,y) = fc' < k): In this case let S* be the set of k top ranked positive points according to the scoring 
function s. Also let SI be the set of k'{= AT(y, y)) top ranked positives and let S 2 = S*\Sl. Then we have 


A(y,y) + VyiSi - rnax VyiS* = A(y,y) + Vy*y*s, + Vy,(l -y*)s, - inax Vy*s* 

^ lly|li=fc ^ lly|li=fc ^ 

K(y,y)=fc*-1 rf(y.y)=fc*-l 

(^) 


< A(y, y) + V Si + V y*(l - y^)si - niax V yiS* 

lly|li=fc ^ 

^ K(y,y)=k 

(B) 


i—1 


< A(y,y) + V Si + V s, - (fc-fc') - max Vy*s, 

i^. Ily|li=fc “ 

le*! K{y,y)=k 

= k — k' + Si — {k — k') — 

= 0 , 


iGS* 


max YiSi 
K{y,y)=k 


where the second step follows since the term {A) consists of k' true positives the third step follows since the term 
{B) contains k — k' false positives i.e. negatives and the fc-margin condition, the fourth step follows since A(y, y) = 
k — K{y, y) and the fifth step follows since by the definition of the set S*, we have 




n 


max 

llyili=fc 

K{y,y)=k 


i=l 


In both cases, we have shown the surrogate to be non-positive. Since the performance measure prec@k cannot take 
negative values, this, along with the upper bounding property implies that prec@k(s) = 0 as well. This finishes the 
proof. □ 


B.3 A Useful Supplementary Lemma 


Lemma 16. Given a set of n real numbers xi ... Xn and any two integers k < k' < n, we have 


min 

|5|=/c 


1 

k 


ieS 


< 


min 

|S'|=fc' 


1 


jes' 


Proof. The above is obviously true if A: = fc' so we assume that k' > k. Without loss of generality assume that the set 
is ordered in ascending order i.e. xi < X 2 < ... < x„. Thus, the above statement is equivalent to showing that 


1 ^ 
Z=1 


1 

- ju/ /A 


i=i 


1 1 


k ^ k' 

i—1 j — k+1 


Xd ^ 


1 


1 1 


k' — k 


k' 

E ^ 

j-k+l 


oj, 


where the last inequality is true since k — k' > 0 and the left hand side is the average of numbers which are all smaller 
than the numbers whose average forms the right hand side. This proves the lemma. □ 


B.4 Proof of the Upper-bounding Property for the ^prfc@k( ) Surrogate 

Claim 17. For any k < n_|_ and scoring function s, we have 

CeFk(s) > prec@k(s). 

Moreover, for linear scoring functions i.e. s(xi) = w^Xifor w G W, the surrogate ^prec@k(^) convex in w. 
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Proof. We use the fact observed before that for any scoring function, we have A(y, = prec@k(s). We start 

off by showing the second part of the claim. Recall the definition of the surrogate ^prec@k('®) 


«avg 

^prec@k 


(w) 


max 

lly|li=fc 


A(y, y) + “ y*) ■ 


2=1 


2=1 


For sake of simplicity, for any y € {0,1}^, define 


A(s,y) = A(y,y) +y^s,(yi 
2=1 


Vi) + 


C(S) 


X](i-y*)y, 


Si. 


i=l 


The convexity of ^prec@k(^) follows from the observation that the inner term in the maximization is linear (hence 
convex) in w and the max function is convex and increasing. We now move on to prove the first part. For sake of 
convenience y = y(®-'=). Note that ||y|l]^ = fc by definition. This gives us 


?avg / x 
'prec@kv'^/ 


max A(s,y) > A(s,y) 
\\y\\i=k 


A(y, y) +s*(yi - Vi) + ~ y*)yi'S* 

n , n 

A(y, y) + X] s,(yi(l - yO - y*(l - y*)) + ^^(1 - y^)y^Si 

^ n+-K{y,y)^ 


2 = 1 


2=1 


A(y, y) + y] y*(l - y»)s* - 


2 = 1 


k - K{y,y) 
n+ - K{y,y) f 


y](i-y*)y*s.. 


i=l 


(A) 


(B) 


Now define m = miny^^i st and M = maxy^^o Sz- This gives us 

yi=o y-=l 


n n 

(^) = y^yi(l - yz)si > ™y^yz(l -y*) = A(y,y) • m, 

2=1 2 = 1 


and 


[B) 


k- K{y,y) 
n+ - K{y,y) 


n 

y^(i - yz)yis* < 
2=1 


k-K{y,y) -A 
n+ - K{y,y) ^ 


(1 - yi)yiM = {k- K{y,y)) ■ M = A(y,y) • M. 


However, by definition of y = we have 


m > min Si > max Si > M. 
y=i y=o 


Thus we have 


Cec@k(s) > ^(y.y) + (^) - (B) > A(y,y)(l + TO - M) > A(y,y) =prec@k(s) □ 

B.5 Proof of Claim |6] 

Claim 6. For any scoring function s that realizes the k-margin over a dataset we have, 

Cec@k(s) =prec@k(s) = 0. 
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Proof. We shall prove that for any y such that ||y|l]^ = k, under the /c-margin condition, we have A(s, y) = 0. This 
will show us that ^prec@k('®) ~ ™^^l|y|li=fc ^(^ly) = 0. Using Claim |l7| and the fact that prec@k(s) > 0 will then 
prove the claimed result. We will analyze two cases in order to do this 

Case 1 (K{y, y) = k): In this case the labeling y is able to identify k relevant points correctly and thus we have 
C{y) = 1 and we have 

n n 

A(s, y) = A(y, y) + ^ Si(y, - y^) + ^(1 - y*)yiSi 

Now, since K{y, y) = k, we have A(y, y) = 0 which means for all i such that y^ = 1, we also have y^ = 1. Thus, 
we have y^ = y^y^. Thus, 


A(s, y) = 0 + ^ Si(y, - yi) + '^{yi - yzy^st = s,{yi - y*) + X!(y* “ y*)®* = ^ 


Case2 (A:(y,y) = k' < k): In this case, y contains false positives. Thus we have 

n 7 ^ 

A(s,y) = A(y,y) + ^Si(yi - y,) + —A 

i=i n+-fc 

AL h — y 

= A(y,y) + ^y,(l-yi)si-^ y.(1 _ y,)si 


2 = 1 


n_L — k' ^' 
^ 2=1 


/ 


= ik-k') 


1 1 " 1 ” 
ir^^^y’ y) + E y*^^ ■ y*)"** ■ ,1^3^ E y*(i - y*)‘ 


k — k' 


2=1 


V 

Now we have, by definition, (A) = 1. We also have 

1 " 

2=1 

as well as 


max s 


3^ 


2=1 


(O') 




n+ — k' 


1 


> <,min - H y*(^ “y*)‘ 

S+CX+ n+ - y frf 
\S+\=n+-k' *es+ 


> min 


S+CX+ n+ - fc + 1 . ^ 

|S+|=n+-fc+l 


ir—TEy*(i-y*)^ 


where the last step follows from Lemma 16 and the fact that fc' < fc — 1 in this case analysis. Then we have 


A(s,y) = (^k—k'){{A) + {B) — {C)) < ik—k') I 1 + max Sj — min 


i:yj=0 " S+CX+ n+ - fc + 1 . ^ 
|S+|=n+-fc+l 


E y*(i-y*)‘ 


where the last step follows because s realizes the fc-margin. Having exhausted all cases, we establish the claim. 


< 0 

□ 
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C Proofs from Section |4] 

C. 1 Proof of Theorem 0 

Theorem 7. Suppose ||xj|| < Rfor all t, i. Let be the cumulative observed mistake values when 

A/gonf/imj^A run. Also, for any predictor w, let Lxiy^) = ^prec@k(^j Yt)- Then we have 


Ay < 


min w 


• R ■ V4fc + \J £t(w)^ . 


Proof. We will prove the theorem using two lemmata that we state below. 

Lemma 18. For any time step t, we have 

||wt||^ < |lwt_i||^ + 4fci?^A( 

Lemma 19. For any fixed w S W, define Pt := (wj, w). Then we have 

Using Lemmata [TS] and [T^ we can establish the mistake bound as follows. A repeated application of Lemma [T^ 
tells us that 

T T 

Py > ^ At - XlCec@k(w;Xt,yt) = Af - £y(w). 

i=l t=l 

In case the right hand side is negative, we already have the result with us. In case it is positive, we can now analyze 
further using the Cauchy-Schwartz inequality, and a repeated application of Lemma 18 Starting from the above we 
have 


Ay < Py+>C7’(w) 

= (wy,w)+£y(w) 

< ||wy|| ||w|| + £t(w) 

< ||w|| AkR? ■ A^ + £t(w), 

which gives us the desired result upon solving the quadratic inequalit}{] We now prove the lemmata below. Note that 
in the following discussion, we have, for sake of brevity, used the notation y = yt = 

Proof of Lemma\^ For time steps where A( = 0, the result obviously holds since wt = Wi_i. For analyzing other 
time steps, let v* = A • “ y*)yi ' A “ Y.ie[b]i^ “ y^y* ' so that Wt = Wt-i + vj. This gives us 

||wt||^ = ||wt_i||^ + 2 (wt_i, Vt) + ||V(||^ . 

Let Si = Then we have 


(wt_i,vt) = Df'^{l-yi)yiSi-'^il-yi)yiS 

/ 


= A* 


1 


Ilytlli - K{yt,yt) 


XI “ y^)y^Si - ^ X 


iG[b] 


ie[b] 


V 


(A) 


(B) 


/ 


More specifically, we use the fact that the inequality {x — l)^ < cx has a solution x < (y/l + y/c)"^ whenever x,l,,c > 0 and x > 1. 
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< 0 , 


where the last step follows since (A) is the average of scores given to the false negatives and (B) is the average of 
scores given to the false positives and by the definition of yt, since false negatives are assigned scores less than false 
positives, we have (A) < {B). We also have 


llvtll = 


< 


‘ Ilytlli - 

< AkR^At, 


•^(l-y,)y,-xj 

iG[6] 


A, 


Ed 

iG[&] 


y^)y^ ■ xj 


2 


since At < k. Combining the two gives us the desired result. 


□ 


Proof of Lemma^]^ We prove the result using two cases. For sake of convenience, we will refer to yt and yt as y and 
y respectively. 

Case 1 (At = 0): In this case Pt = Pt-i since the model is not updated. However, since ^prec@k(''^) — 


^prec@kV 

prec@k(w) > 0 for all w S W (by Claim 17 1 , we still get 

Pt>Pt-i-Cfc@k(w;Xt,yt), 

as required. 

Case 2 (At > 0); In this case we use the update to Wt_i to evaluate the update to Pt-i- For sake of convenience. 


let us use the notation Si = w^Xj. Also note that in Algorithmj^ Dt 


= 1 - 


C(y)- 


Pt = Pt-l-^{l-y^)y^s^ +Df^{l-yi)y. 

ie[b] iG[b] 

= Pt-1 -- yi)y^s^ + (i - ^ 

ie[b] ^ 


c{y) 


iG[6] 


= -Pt-i - I ^(yt - y*)i 

UG[b] 


C'(y) 


Ed-y*)yi 


iG[h] 


(Q) 

> Pt-i+At-£;;Ww;Xt,yt), 
where the last step follows from the definition of which gives us 

At + (Q) = A(y, y) + ^ (yz - y^)s^ + E d - ydy» 

i&[b] iG[b] 


- y) + E - yd + 7^ Ed- ydy 


lly|li=fc 


iG[b] 


c{y) 


zG[h] 


= Cec@kd)=Cec@k(w;Xt,yt) 


This concludes the proof of the mistake bound. 


□ 

□ 


C.2 Proof of Theorem |9] 


Theorem 9. Suppose ||xt|| < Rfor all t,i. Let A^ = ^t be the cumulative observed mistake values when 

Algorithm^^is run. Also, for any predictor w, let >C™'“(w) = X)t=i Cr“@k(w; Xt, yt). Then we have 


Ay < min 

W 



2 
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Proof. As before, we will prove this theorem in two parts. Lemma 18 will continue to hold in this case as well. 
However, we will need a modified form of Lemma [T^ that we prove below. As before, we will use the notation 

Lemma 20. For any fixed w G W, define Pt '■= (wj, w). Then we have 


Using Lemmata [TS] and [20] the theorem follows as before. All that remains now is to prove Lemma [20| 

Proof of Lemma^^ We prove the result using two cases as before. For sake of convenience, we will refer to yt and 
yt as y and y respectively. 

Case 1 (At = 0): In this case Pt = Pt-i since the model is not updated. However, since ^[5Jec@k(^) — 
prec@k(w) > 0 for all w G W (by Claim[^, we still get 

Pt > Pt-i - f”“@k(w;Xt,yt), 


as required. 

Case 2 (At > 0); In this case we use the update to Wt_i to evaluate the update to Pt-i- For sake of convenience, 
let us use the notation Si = w^xj. Also note that the set St := FN(w‘“^, At) contains the false negatives in the top 
At positions as ranked by 


Pt = Pt-l-'^{l-y^)y^si + '^{l-y^)y^si 

ie[b] ieSt 

= Pt-i - 51 (1 - y*)y*«* - E - y*)y*«* 

i^St 

= Pt-i - Y. + 51 + 51 (i - y^)y^s^ 

= -^*-1 “ 51 “ y*)®* + 51 “ yt)y^Si - 51 (1 “ y*)y»s* 


- -Ft-i- V(yi-yi)sj+ max Vy* 
' — yd(i-y)'y — 


iG [6] 


lly|li='*+-fc 


i=l 


(Q) 


> Ft-1 + yt)) 

where the last step follows from the definition of fp)fc@k(') which gives us 


+ iQ) — \ + (y* 

iG[b] 


yi)si + 


n 


max 

y^(i-y)-y 

lly|li="+-'= 


5Iy*s* 

i=l 


< max 
lly|li=fe 


+ 5^(yz 

iG[6] 


y^)si + 


max 

y^(i-y)-y 

l|y|li="+-fe 


Yy^^i 
2 = 1 


pmax (— pmax 
^ prec @ k V ^ prec @ k 


(w;Xt,yt) 


□ 


This concludes the proof of the theorem. 


□ 
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D Proof of Theorem [12] 


Our proof of Theorem 12 crucially utilizes the following two lemmas that helps in exploiting the structure in our 
surrogate functions. The first basic lemma states that the pointwise supremum of a set of Lipschitz functions is also 
Lipschitz. 


Lemma 21. Let /i,..., /„ be m real valued functions fi : M" —>■ K such that every fi is 1-Lipschitz with respect to 
the II’ll go norm. Then the function 

g{w) = max /^(v) 

i£[m\ 

is 1-Lipschitz with respect to the H-Hq^ norm too. 

The second lemma establishes the convergence of additive estimates over the top of ranked lists. The abstract 
nature of the result would allow us to apply it to a wide variety of situations and would be crucial to our analyses. 

Lemma 22. Let V be a universe with a total order ^ established on it and Zef vi,..., v„ Ijc a population of n items 
arranged in decreasing order. Let vi,..., Vf, Zje a sample chosen Ltd. (or without replacement) from the population 
and arranged in decreasing order as well. Then for any fixed Zi : V —>■ [—1,1] and k € (0,1], we have, with probability 
at least 1 — (5 over the choice of the samples. 


1 

[kti] 


Ikti] 




1 





Theorem 12. The performance measure prec@«;(-), as well as the surrogates ^prec@K(')’ 

exhibit uniform convergence at the rate a{b, 5) = O ^ log 

We will prove the four parts of the theorem in three separate subsections below. We shall consider a population 
Zi,..., z„ and a sample of size Z) Zi,..., Z{, chosen uniformly at random with (i.e. i.i.d.) or without replacement. We 
shall let p and p denote the fraction of positives in the population and the sample respectively. In the following, we 
shall reserve the notation y for the label vector in the sample and shall use the notation y to denote candidate labellings 
in the definition of the surrogate. 


D.l A Uniform Convergence Bound for the prec@K( ) Performance Measure 


We note that a point-wise convergence result for prec@K( ) follows simply from Lemma 22 To see this, given a 
population zi,..., z)n and a fixed model w S W, construct a parallel population using the transformation ^ 
(w^Xi,yi) e We order these tuples according to their first component, i.e. along the scores and use Zi(vi) = 
1 — y^. Let the population be arranged such that vi ^ V 2 h • ■ Then this gives us 


= prec@k(y,y('^’''^) = prec@k(w). 


Thus, the application of Lemma 22 gives us the following result 

Lemma 23. For any fixed model w € W, with probability at least 1 — i5 over the choice ofb samples, we have 
|prec@K(w;zi,... ,z„) - prec@K(w;Zi,...,Zb)| < O 



To prove the uniform convergence result, we will, in some sense, require a uniform version of Lemma 22 To do 
so we fix some notation. For any fixed k > 0, and for any w G W, we will define as the largest real number v such 
that 


I [w^Xi > u] = Kpn 
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Similarly, we will define v^, as the largest real number v such that 


I [w^X.j > u] = Kpb 


i=l 


Using this notation we can redefine prec@K( ) on the population, as well as the sample, as 


prec@K(w;zi,... ,z„) := -[w^x > -^w] - I [Vi = 0] 


npn 


prec@K(w;zi,... ,Zb) := ^ [w^x > Dw] • I [yi = 0] 


We can now write 


sup |prec@K(w;zi,...,z„) - prec@K(w; Zi,..., zt,)| 
wG VV 


= sup 
wG VV 


< sup 
wG VV 


1 1 

— [w^x > Uw] • I [y^ = 0] - ^ [w^x > Dw] • I [yz = 0] 

^ i—X ^ i—\ 

1 n ^ b 

— [w^x > Uw] • I [yz = 0] - ^ [w^x > Uw] • I [yz = 0] 

^ i—X ^ i—X 


sup 
wG VV 


1 h ^ b 

— [w^x > Uw] • I[yz = 0] - ^ [w^x > Uw] • I[yz = 0] 


< sup 


1 1 
— [w^x >t]-I [yz = 0] - ^ > i] • I [yz = 0] 


Kpb 

1 

Kpb 


2 = 1 
b 


sup 
wG VV 


(^) 

1 b ^ b 

— [w^x > Uw] • I [yz = 0] - ^ [w^x > Uw] • I [yz = 0] 


Kpb 


(B) 


Now, using a standard VC-dimension based uniform convergence argument over the class of thresholded classifiers, 
we get the following result: with probability at least 1 — ^ 


iA)<oiJ^ Mog i + dvc(W) • log 6 j I = (5 | ^ ^ log ^ 


where (ivc(W) is the VC-dimension of the set of classifiers W. Moving on to bound the second term, we can use an 
argument similar to the one used to prove Lemma|^to show that 


{B) < sup 
wG W 


< sup 
WG W 


< sup 
WG W 




-^^ll[wTx>»,] -K 


2=1 

b 


Kpb 


Kpb 

1 

Kpb 


2=1 

b 


1 , ^ 1 

—— I [w^x > Uw] — y^ I [w^x > Uv 

Kun KVTl 

2=1 ^ 2 = 1 
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< o 



where the last step follows from a standard VC-dimension based uniform convergence argument as before. This 
establishes the following uniform convergence result for the prec@k( ) performance measure 

Theorem 24. We have, with probability at least 1 — (5 over the choice ofb samples. 


sup |prec@K(w; zi,..., z„) — prec@K(w; Zi,..., z;,)| < O 

wGW 



D.2 A Uniform Convergence Bound for the ^pr“@K( ) Surrogate 

We first recall the form of the (normalized) surrogate below - note that this is a non-convex surrogate. Also recall that 
k = K- n+(y). 




J A(y,y) 1 






(w; Zi . .,Zn) 


max — 
l|y|li=fc k 

K{y,y)=k 

^2(w: zi,...,Zji) 


i=l 


We will now show that both the functions as well as 4'2(-)> exhibit uniform convergence. This shall suffice to 

prove that exhibits uniform convergence. To do so we shall show that the two functions exhibit pointwise 

convergence and that they are Lipschitz. This will allow a standard Loo covering number argument Zhang |2002| to 
give us the required uniform convergence results. 


D.2.1 A Uniform Convergence Result for 'I'l (•) 
We have 


T'i(w; zi,... ,z„) = max i ^ 

||y||i = K;pn ^ KpU ^ J 


T'i(w; zi,.. .,zt)= max i E E y*(’^^** “ y*) r' + ^ 
||y|h = «;p6 [upb^ J 

An application of Corollary p^indicates that ) is Lipschitz i.e. 

|T'i(w; zi,... ,z„) - T'i(w'; zi,..., z„)| < O (||w - w'|| 2 ). 

Thus, all that remains is to prove pointwise convergence. We decompose the error as follows 

r 1 

|T'i(w; Zi,... ,z„) - T'i(w; Zi,..., Zf,)] < 


1 


T'i(w; zi,... ,z„) - max ^ Vy,(w^x,; - y,) [■ + 1 
||y|h=Kph J 


(^) 


max 


yW^^Kpb I hipb ^ 


—7 “ y*) f + i-^i(w; zi,...,zf,) 


(S) 


An application of Lemma 22 using = w ' and h{-) as the identity function shows us that 
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To bound the residual term {B), notice that an application of the Hoeffding’s inequality tells us that with probability 
at least 1 — 5 

which lets us bound the residual as follows. Assume, for sake of simplicity, that the sample data points have been 
ordered in decreasing order of the quantity as well as that Iw^xl < 1 for all x. 


{B) = 


„ max i ^ V y^(w^x^ - y,) 
\\y\\,=Kpb \^Kpb J \\y\\,=Kpb \^Kpb ^ 

Kpb Kpb 

—r - y,)-^(w^Xi - y,) 

i—1 i—1 


< 


«min{p.p}& , 


p-p 


2 

< — 

Kb pp 

= 2\p-p\- 


K min {p, p} b 
f min{p,p} 


< \/ 4 log i 


K max {p, p} b 
1 

PP ' max{p,p} 

2 


K max {p, p} b 
• K Ip — p| 6 


K max{p,p}6 

^ (w^Xi - y,) 

i—K min{p,p}6+l 


2/1 2 

< ZV hi log 7 


2b ° 5 max{p,p} p V 26 ° 6 

This establishes that for any hxed w S W, with probability at least 1 — 5, we have 


|T'i(w; zi,... ,z„) - T-i/w; Zi,..., z,,)| < O ( l/ ^ log ^ 


which concludes the uniform convergence proof. 


D.2.2 A Uniform Convergence Result for '^2 (■) 


The proof follows similarly here with a direct application of Corollary 29 showing us that T' 2 ( ) is Lipschitz and an 
application of Lemma 
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along with the observation that |p — p| < y ^ log 


similar to the discussion used above 


concluding the point-wise convergence proof. 

The above two part argument establishes the following uniform convergence result for the performance 

measure 

Theorem 25. We have, with probability at least 1 — 5 over the choice ofb samples. 


sup 

wGW 


Cc@«(w; zi,..., z„) - zi,..., Zb) 


<Oh/llogl 


D.3 A Uniform Convergence Bound for the ^prec@K( ) Surrogate 

This will be the most involved of the four bounds, given the intricate nature of the surrogate. We will prove this result 
using a series of partial results which we state below. As before, for any w G W and any y, we define 


^ / n ^ n > 

A(w, y) — [ A(y, y) + ^(y* - y*)w^x, -f ^(1 - y^yi^^x^ 


Kpn 
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^ / ri ^ ri > 

A(w, y) := — A(y, y) + y'(y* - y*)w^x, + -— V(1 - y*)y*w^Xi 
i^pb \ ^ C(y) ^ ^ 


Recall that we are using y to denote the true labels of the sample points and y to denote the candidate labellings while 
defining the surrogates. We also define, for any (3 G [0,1], the following quantities 


A(w,/3):= max {A(w,y)} 

||y||i=Kpn 

K{y,y)=lipn 


A(w, j3) := max 
Ily|li=«;j56 

K(y,y)=l3pb 


{A(w,y)| 


Note that (3 denotes a target true positive rate and consequently, can only take values between 0 and k. Given the 
above, we claim the following lemmata 

Lemma 26. For every w and any j3, (3' G [0, n], we have 

|A(w,/3) - A(w,/3')| < 0(1/3- /3'|). 

Lemma 27. For any fixed (3, we have, with probability at least 1 — S over the choice of the sample 


sup 

wGW 


A(w,/3) - A(w,/3) 


< O 



Using the above two lemmata as given, we can now prove the desired uniform convergence result for the ^prec@K(’) 
surrogate: 

Theorem 28. With probability at least 1 — 6 over the choice of the samples, we have 


sup 

wGW 






• ■ ,Zh) 


< o 



Proof. We note that given the definitions of A{w,P) and A(w,/3), we can redefine the performance measure as 
follows 

Cec@«:(w; Zi,..., z„) = max A(w, 13) 

/3G[0,kJ 

We now note that for the population, the set of achievable values of true positive rates i.e. /3 is 


B = 


0 1 2 npn — 1 

\ ’ Kpn’ Kpn’ ’ Kpn 



which correspond, respectively, to classifiers for which the number of true positives equals {0,1,2 ... npn — 1, Kpn}. 
Similarly, the set of achievable values of true positive rates i.e. /3 for the sample is 


5 = <^ 0 


1 2 
Kpb' Kpb' 


Kpb — 1 

Kpb 


A . 


Clearly, for any (3 G B, there exists a 7rg(/3) G B such that 


7rR(/3) -/3 < 


Kpb 


Given this, let us define 


/3*(w) = arg max A(w,/3) 
/3 g[ 0 .k] 
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/3*(w) = arg max A(w,^) 

/3g[0,k] 

We shall assume, for the sake of simplicity, that s\n so that B (Z B. This gives us the following set of inequalities for 
any w G W: 


A(w,/3*(w)) < A(w,7rj^(/3*(w))) + |/?*(w) - 7r^(^*(w))| 

< A(w, 7rj^(/3*(w))) + sup A(w, 7r^(/3*(w))) - A(w, 7r^(/3*(w))) 


wSW 


< A(w, 7rg{P*{w))) + sup A(w, /3) - A(w, /3) 




1 

npb 


< A(w,7rA(/?*(w))) +0 




1 

Kpb 


1 

Kpb 


< A(w,/3*(w)) + 0 I \ll\og^] +^, 


where the first step follows from Lemma 26 the third step follows since tt^ (/3* (w)) G B, the fourth step follows from 
an application of the union bound with Lemma 27 over the set of elements in B and noting B < O (b), and the last 
step follows from the optimality of /3* (w). Similarly we can write, for any w G W, 

A(w,/3*(w)) < A(w,/3*(w)) + O I y^logM 


< A(w,/3*(w)) + O I \/plog^ ) , 


where the first step uses Lemma 27 with a union bound over elements in B and the fact that /3* (w) G B C B (note 
that this assumption is not crucial to the argument - indeed, even if /3*(w) ^ B, we would only incur an extra O (i) 
error by an application of Lemma [2^ since given the granularity of B, we would always be able to find a value in B 
that is no more than O (-) far from /3*(w)), and the last step uses the optimality of /3*(w). Thus, we can write 


sup 

wGW 




prec@«(w; Zi, . . . , Z„) - fp„c@«(w; Zp, . . . , Zfe) 


= sup 
wew 


< O 


A(w,^*(w)) - A(w,/3*(w)) 




1 

Kpb 


< O 




since p > (1) with probability at least 1 — <5. Thus, all we are left is to prove Lemmata 26 and 27 which we do 

below. To proceed with the proofs, we first write the form of A(w, /3) for a fixed w and /? and simplify the expression 
for ease of further analysis. We shall assume, for sake of simplicity, that jSpn, Kpn, jSpb, and Kpb are all integers. 


A{w,/3) = max < - 

||y||i=Kpn I Kpn 

K{y.y)=Ppn ^ 


A(y, y) + “ y»)w^x, + ^(1 - y,)y*w^x, 


= 1 - " - 


1 


K Kpn \1 — (3 J 


K — jS 




W^X,; 


max 


i=l 


||y|li=Kpn npn^ 




2 = 1 


1 — K 




1-/S 


Yi W Xi 


A(w,/3) 


S(w,/3) 
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We can similarly define A(w, /3) and B{w, /?) for the samples. 

Proof of Lemma^^ We have, by the above simplification, 

|A(w,/3) - A(w,/3')| = -\/3-f3'\ + |A(w,/3) - A(w,^')l + l■B(w,/3) - S(w,/3')|, 

K, 


as well as, assuming without loss of generality, that |w^x| < 1 for all w and x, 

\A{w,f3) - A(w,/3')| < 


K — P K — P' 


n 

A . T,. 

1-P 1-/3' 


/ 

Kpn 


^ (l-/c)|/3-/3'| ^ 


1 


<1 -/ 3)(1 -/ 3 ') k{1-k) 


1/3-/3'l, 


where the last step follows since /3,f3' < k . To analyze the third term i.e. |i3(w,/3) — i3(w,/3')|, we analyze the 
nature of the assignment y which defines i3(w, /3). Clearly y must assign Ppn positives and {k — P)pn negatives a 
label of 1 and the rest, a label of 0. Since it is supposed to maximize the scores thus obtained, it clearly assigns the top 
ranked {k — P)pn negatives a label of 1. As far as positives are concerned, /3 < k, we have ^1 — > 0 which 

means that the Ppn top ranked positives will get assigned a label of 1. 

To formalize this, let us set some notation. Let > sj > ... > Sp„ denote the scores of the positive points 
atTanged in descending order. Similarly, let sj" > S 2 ^ ^ denote the scores of the negative points 

ari'anged in descending order. Given this notation, we can rewrite B{'w, P) as follows: 


B{vf,P) = 


0pn {K-f3)pn 


Kpn \ \1 — p J ^ 


2=1 


2=1 


Thus, assuming without loss of generality that | , | < 1, we have, 

/3pn {K—^)pn 


\B{w,P) - B{w,P')\ = 


1 

Kpn 


K — p 

lATp 


e 


K-P' 

TAJ' 


3'pn {K—l3')pn 

e ^ 


< 


< 


1 

(k-P^ 

^pn 

\ V 

fK-p'^ 

0 pn 

\ V s+ 

1 

Kpn 

11-/3 y 

2=1 

Vl-/ 3 'y 

2=1 

Kpn 


1 

it 

1 

it 


min|/3,/3^ j-pn 

^ V S+ 

1-P 1-/3' 


/ *?■ 

Kpn 


(i)pn {K,—^')pn 

E E 


1 K-max{^,/3'} ^ \P-P'\pn 

-i- iQ Qn \p ~ \pTl-\ - 

Kpn 1 — max Ip, p'j Kpn 


^ 1 ^/| niin{^,/3'}prt ^ 1 k - max {/3,/3'} , ^ ^ \P - P'\ 


< 


k(1 — k) 
2 


Kpn 


Kl-max{p,p'}^^ 


K 


(1 - k) 


1/3-/3'|, 


where the last step uses the fact that 0 < P, P' < k. This tells us that 

IA(w, P) - A(w, P')\ < —E 1/3 - /3'l, 

Atfl — K) 

which finishes the proof. □ 
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Proof of Lemma^7\ We will prove the theorem by showing that the terms yl(w,/3) and B{'w,f3) exhibit uniform 
convergence. 

It is easy to see that ^(w, /3) exhibits uniform convergence since it is a simple average of population scores. The 
only thing to be taken care of is that A(w, /3) contains p in the normalization whereas A(w, /3) contains p. However, 
since p and p are very close with high probability, an argument similar to the one used in the proof of Theorem[25|can 
be used to conclude that with probability at least 1 — 5, we have 


sup 

wGW 


A{w,P) - A{w,I3) 


< O 



To prove uniform convergence for B{w, /?) we will use our earlier method of showing that this function exhibits 
pointwise convergence and that this function is Lipschitz with respect to w. The Lipschitz property of B{w,f3) is 
evident from an application of Corollary]^ To analyze its pointwise convergence property 

Thus the function i3(w,/3), as analyzed in the proof of Lemma 26 is composed by sorting the positives and 
negatives separately and taking the top few positions in each list and adding the scores present therein. This allows an 
application of Lemma [2^ as used in the proof of Theorem[25] separately to the positive and negative lists, to conclude 
the pointwise convergence bound for B(w, /3). □ 


This concludes the proof of the uniform convergence bound for 


□ 


D.4 Proof of Lemma l2l] 

Lemma 21. Let /i,..., fm be m real valued functions fi : M" —>■ K such that every f is 1-Lipschitz with respect to 
the norm. Then the function 

g{w) = max /,(v) 

i£[m\ 

is 1-Lipschitz with respect to the IHI^ norm too. 

Proof Fix V, v' G K". The premise guarantees us that for any i G [m], we have 

|/.(v)-/.(v')|<||v-v'||^. 


Now let g{v) = fi{v) and p(v') = fjiv'). Then we have 

5 (v) - g(v') = /,(v) - /j(v') < /,(v) - /,(v') < ||v - v'll^ , 
since fjiy') > fii'v'). Similarly we have p(v') — g{v) < |jv — v'||^. This completes the proof. □ 

The following corollary would be most useful in our subsequent analyses. 

Corollary 29. Let T* : W —M he a function defined as follows 

Tf(w) = max ^ yi(w^Xj - Ci), 
yelo.!}" k ^ 
lly|li=fc 

where Ci are constants independent ofw and we assume without loss of generality that ||xj ||2 < 1/or all i. Then T'(-) 
is 1- Lipschitz with respect to the L 2 norm i.e. for all w, w' € W 

l'k(w) — 'I'(w')| < |jw — w '||2 . 

Proof Note that for any y such that ||y|| = k, the function /y (v) = ]: X) yi(w — q) is 1-Lipschitz with respect to 
the II’ll norm. Thus if we dehne 

^(v) = max /y(v), 
lly|li=fc 
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then an application of Lemma 21 tells us that $(•) is 1-Lipschitz with respect to the || • norm as well. Also note that 
if we define 

v(w) = (w^Xi - Cl, , w^x„ - c„) , 

then we have 

^'(w) = $(v(w)) 

We now note that by an application of Cauchy-Schwartz inequality, and the fact that ||xi ||2 < 1 for all i, we have 

||v(w) - v(w')||^ < ||w-w'||2 

Thus we have 

|T'(w) - T'(w')| = |$(v(w)) - T>(v(w'))| < ||v(w) - v(w')||^ < |jw - w'||2 
which gives us the desired result. □ 


D.5 Proof of Lemma l22l 

Lemma 22. Let V be a universe with a total order A established on it and let Vi,... be a population of n items 
arranged in decreasing order. Let iri,... ,'Vb be a sample chosen Lid. (or without replacement) from the population 
and arranged in decreasing order as well. Then for any fixed : V —)■ [—1,1] and k € (0,1], we have, with probability 
at least 1 — (5 over the choice of the samples. 


1 

Iku] 


[Kn] 




1 







Proof. We will assume, for sake of simplicity, that kit, and nb are both integers so that there are no rounding off 
issues. Let v* v^n and := v^f; denote the elements at the bottom of the K-th fraction of the top in the sorted 
population and sample lists (recall that the population and the sample lists are sorted in descending order). Also let 
T(v) := I [v A V*] and T(v) := I [v A v^] (note that I [E] is the indicator variable for the event E) so that we have 


1 

Kn 


2=1 


Kb 


nb 




< 


nn 


< 2 


1 ” 1 ^ . 

- T(vi) • h{vi) - 

Kn KO 

2=1 2=1 

.. n 1 ^ 1 

-^T(vi) •/i(vi) - — ^T(v,) •/i(v,) + — ^ (|T(vi)-t(vi)) •/i(v^) 

2 = 1 

^ (t(v,) - t(v,)) -hfvf) 


' log I 

Kb 


1 

Kb 


(A) 


where the third step follows from Bernstein’s inequality (which holds in situations with sampling without replacement 


Boucheron et al. 


12004 ) since |T(v) • /i(v)| < 1 for all v and we have assumed b> - log 4. Now if v* A 


as well 

then we have T(v) > T(v) for all v. On the other hand if A v*, then we have T(v) < T(v) for all v. This means 
that since |/i(v)| < 1 for all v, we have 


(^)< 


Lt(’-(v.)-T(v.)) 


1 

Kb 


b 

^T(v0-1 

2=1 



where the second step follows since ^ = 1 by definition and the last step follows from another applica¬ 
tion of Bernstein’s inequality. This completes the proof. □ 
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D.6 A Uniform Convergence Bound for the ^prec@K( ) Surrogate 

Having proved a generalization bound for the ^prec@K(') surrogate, we note that similar techniques, that involve par¬ 
titioning the candidate label space into labels that have a fixed tme positive rate /3, and arguing uniform convergence 
for each partition, can be used to prove a generalization bound for the surrogate as well. We postpone the 

details of the argument to a later version of the paper. 


E Proof of Theorem [15] 

Theorem 15. Let w be the model returned by Algorithm when executed on a stream with T batches of length b. 
Then with probability at least 1 — 5, for any w* G W, we have 





+ o 



Proof The proof of this theorem closely follows that of Theorems 7 and 8 in Kar et al. |2014j. More specifically. The¬ 
orem 6 from |Kar et al.|pOT4| ensures that any convex loss function demonstrating uniform convergence would ensure 

confirms that ^pjfc@K(’) exhibits uniform convergence, 

□ 


a result of the kind we are trying to prove. Since Theorem 
the proof follows. 
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F Additional Empirical Results 





-SVMPerf 

-1PMB 

Perceptron@k-avg 

Perceptron@k-max 

SGD@k-avg 

SGD@k-max 


(a) KDD08 


(b) Covtype 


(c) Cod-RNA 


Figure 4: A comparison of the proposed perceptron and SGD based methods with baseline methods (SVMPerf and 
IPMB) on prec@0.25 maximization tasks. 
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